Introduction to Principal Component Analysis

What is PCA?


PCA (Principal Component Analysis) is a unsupervised machine learning algorithm used to identify several uncorrelated principal components. This technique can find strong patterns and explanations of variance that are difficult to see. Explore the tabs below to see how PCA can be used with different datasets.


Sections


PCA Overview

What is PCA used for?

When working with massive data sets with thousands of columns it could be useful to transform the data between weighted (principal) components that explain the majority of variation in the data. Some reasons to use PCA are:

  • Dimension Reduction
  • Computational Savings
  • Visualizations
  • Interpretation of components instead of features



Think of PCA as a change of coordinates for each point. Shown below is a subset of a dataset.

datatable(head(exampleDF), style = 'bootstrap', options = list(dom = 't'))

preserve5c9066e35f7e304a

Our goal here is to transform this data into linear combinations of the principal components. Through these new components we will get a new coordinate system, which is useful for dimension reduction. Imagine our dataset, ‘\(A\)’, as an ‘\(N \times d\)’ matrix and we wish to transform this into a \(N \times m\) matrix. \[ A \in \mathbb{R}^{N \times d} \Longrightarrow A \in \mathbb{R}^{N \times m} \] By reducing \(d\) columns to \(m\) columns we are eliminating columns that are strong linear combinations of one another. To do this we must find the covariance matrix for matrix \(X\). Before solving the covariance matrix we need to standardize the data. Once done, solve for a \(d \times d\) covariance matrix \(C\):

\[ \begin{align} cov &= \frac{1}{N - 1}X^{T}X \\ &= \frac{1}{N - 1}\sum^N_{z=1}X_{z,i}\bullet X_{z,j} \end{align} \] The covariance matrix, \(C\), will have the variance of variable \(i\) for all diagonals, \(C_{i,i}\). The covariance between variables \(i, j\) is in \(C_{i,j}, C_{j,i}\). Next we want to calculate the eigenvalues (\(\lambda_i\)) of the covariance matrix. The eigenvalues are the solutions to the system: \[ det(C - \lambda I) = 0 \] Once we have the eigenvalues, we select \(m\) eigenvectors that correspond to the \(m\) largest eigenvalues. \[ \vec{m} = \{\lambda_1, \lambda_2 \cdots \lambda_m\} \qquad \lambda_i \geq \lambda_j \quad \forall \quad i>j \] To determine the amount of eigenvectors to drop and eigenvalues to keep we want to select the eigenvectors that explain the most variance of the data with the lowest amount of column variables. First calculate the variance for each eigenvalue, then divide this value by the total variance of the eigenvalues.

Lets visualize the data (centralized so \(\mu = 0\)) with respect to 3 dimensions. Try and see where the axis of most variation is.

library(plotly)
# centering with 'scale()'
center_scale <- function(x) {
    scale(x, scale = FALSE)
}
#converting to a dataframe
tempDF <- center_scale(irisDF[c(gx, gy, gz)])
tempDF <- data.frame(tempDF)

#plotting in 3 dimensions (x,y,z)
p <- plot_ly(tempDF, x = px, y = py, z = pz, marker = list(size = 5), height = 600) %>%
  add_markers() %>%
  #specifying the layout of the graph and default view
  layout(scene = list(xaxis = list(title = 'x'),
                     yaxis = list(title = 'y'),
                     zaxis = list(title = 'z')
                     )
         )

p

preservec9ef35fa5bf15dd5


We need to find which eigenvectors of the covariance matrix explain the variation in the data. After subtracting the column \(\mu\) from each point (centralize the data), we are able to find a covariance matrix. Here are the eigenvalues and eigenvectors for the subset of Iris data.

## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise
# Finding the covariance matrix of our data (leaving out the 'Species' column)
covdf <- cov(centeredIris[,1:3])

# Finding the eigenvalues and eigenvectors
eigenInfo <- eigen(covdf)

# Eigen-Values
eigenInfo$values
## [1] 3.69638110 0.15685442 0.03402802
# Eigen-Vectors
eigenInfo$vectors
##             [,1]       [,2]       [,3]
## [1,]  0.38539668 0.15645943  0.9093898
## [2,] -0.09961563 0.98681521 -0.1275636
## [3,]  0.91735823 0.04142686 -0.3959011



Observe the eigenvalues that we found. We can find the culmulative sum of variance, \(T\). \[ \begin{align} T &= \sum_{i = 1}^n \lambda_i \\ &= \lambda_1 + \lambda_2 + \lambda_3 \\ &= 3.69638110 + 0.15685442 + 0.03402802 \\ &= 3.88726354 \end{align} \] To determine the variation explained by each principal component we simply divide \(\lambda_i\) by \(T\). Computing the explained variation for each eigenvalue yields the following: \[ var(\lambda_1) = \frac{3.691}{3.992} = 0.9509 \qquad var(\lambda_2) = \frac{0.1569}{3.992} = 0.04035085823 \qquad var(\lambda_3) = \frac{0.595}{3.992} = 0.00875 \qquad \] The principal component of \(\lambda_1\) explains 95.1% of the variance of the data, while \(\lambda_2\) and \(\lambda_3\) explain 4% and 0.875% respectively. This means that by using principal component 1 and principal component 2 we can explain 99.1% of the variation of the data.


Let’s plot the eigenvectors alongside the data and see if our PCA components agree.

preserve4b9407db80cd64a1 From the graph you can see that the first eigenvector explains a lot of the variation in the data, while the other two eigenvectors don’t add nearly as much information. Lets transform the data to use PC1 and PC2 as the two axes.


To do this we need to multiply our centralized data matrix by \(P\) , the matrix containing eigenvectors sorted based on the magnitude of the eigenvalues. The product of this will be called \(X*\), and will be our data transformed along the eigenvalues.

library("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)

#scores$Species = iris[,c("Species")]

p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, marker = list(size = 8), height = 600) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
                     yaxis = list(title = 'PC2')))

p

preserve2ab25b8c1307bd96

It may look like we dropped an entire column of data from the dataset, but 98.5% of the original data is still there for analysis. By transforming the data into linear combinations of the principal components, we can discard components that only explain a small percentage of variation in the data. This is useful for large data since it allows for dimensionality reduction and in turn a decrease of computation time.


We can use this reduced data for further analysis. We wish to try and classify this data into definitive groups. One such machine learning algorithm is known as K-Means. We can use this algorithm to assign each point a cluster, and then further analyze the similarities in each cluster.

library("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)

k <- kmeans(scores, centers = 3)
scores$Cluster = k$cluster

p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, color = ~Cluster, marker = list(size = 8), height = 600) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
                     yaxis = list(title = 'PC2'))) %>%
  hide_colorbar()

p

preserve4f68bb41065475d5 Observing the graph we can see 3 definitive clusters. The dataset is actually the Iris dataset, so we can check if our analysis correctly classified the species of flower.

library("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)

k <- kmeans(scores, centers = 3)
scores$Cluster = k$cluster
scores$Species = as.numeric(factor(iris$Species))

p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, color = ~Cluster, marker = list(symbol = ~Species, size = 8), height = 600) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2')))

p

preserve032d245707ef5cc3 From this we can see that our clustering almost captured all of the species correctly. This means that PCA was able to reduce the data to only 2 variables and still explain the vast majority of the dataset. PCA is a great tool for directing data analysis. Using PCA to visualize/understand the differences in the data allows statisticians to further analyze the data with underlying knowledge of the components.

Weekly Food Consumption

Often times the information stored within the data is hard to see at first glance. The data we are interested in contains the average consumption of 17 types of food in grams per person per week for four countries.

datatable(head(ukdf2), style = 'bootstrap', options = list(dom = 't'))

preserved8ed1f40e40e7cff Looking at numbers can only help so much, let’s create a grouped barplot to represent every category for each country.

p <- ggplot(ukdf, aes(factor(Category), Amount, fill = Country)) + 
  geom_bar(stat="identity", position = "dodge") + 
  scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(p)

preservedaae9e54e56b65bc From this graph most of the countries look relatively similar. We could analyze specific categories if we wanted to, but a better approach would be use PCA so that the data can be captured with less than 17 columns. To start we need to find the covariance matrix for our data.

#using the same dataset mutated so that the columns are the countries
rotated <- read.csv("uk2.csv")

#transpose of the data set for PCA, converted back to a data frame
rotated <- data.frame(t(rotated[c(1:17), c(2:5)]))

#function to center the data
center_scale <- function(x) {
  scale(x, scale = FALSE)
}

#apply the center_scale function
rotated <- data.frame(center_scale(rotated[c(colnames(rotated))]))

#computing the covariance matrix
coMat <- cov(rotated)
datatable(coMat[1:5,1:5], style = 'bootstrap', options = list(dom = 't'))

preservefa5f9e2cca6a7a18 Shown above are only 6 out of the 17 columns of our covariance matrix. Now that we have our covariance matrix we need to compute the eigenvalues and eigenvectors to get our components. Computing the eigenvalues and eigenvectors in R is easy.

\[ \Large \text{Eigenvalues} \]

#getting the eigen information
e <- eigen(coMat)

#saving the values as a dataframe for easier viewing
edf <- data.frame(e$values)
#naming the rows
rownames(edf) <- seq(1,17,1)

#rendering the table
edf[1:6,] %>%
  kable(col.names = "Magnitude") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Magnitude
108504.971
45993.722
7145.557
0.000
0.000
0.000

\[ \Large \text{Eigenvectors} \]

#saving the vectors as a dataframe for easier viewing
evdf <- data.frame(e$vectors)
#naming the rows and columns
colnames(evdf) <- c("EV1","EV2","EV3","EV4","EV5","EV6","EV7","EV8","EV9","EV10","EV11","EV12","EV13","EV14","EV15","EV16","EV17")
rownames(evdf) <- seq(1,17,1)
#rendering the table
evdf[,1:6] %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
EV1 EV2 EV3 EV4 EV5 EV6
-0.4558165 0.1363311 -0.4029357 0.5945762 0.0000000 0.0000000
-0.0268097 -0.0290869 -0.0307685 0.6120113 0.0268932 0.0503393
0.0478535 0.0114693 0.0498089 0.0905221 -0.1652756 -0.0926986
-0.0547470 -0.2119961 -0.2886565 -0.1504401 0.7067578 0.3490614
-0.0554048 0.0198675 0.0246793 -0.0234021 -0.0307836 0.1446307
-0.0292834 0.0071605 -0.0434882 -0.0344617 0.0018091 0.0497705
-0.0084246 -0.0957200 -0.0997785 -0.0163238 0.1041396 -0.3789877
-0.0841219 -0.0442831 0.0467818 0.0006954 -0.3433819 -0.0229229
-0.6243112 -0.1298216 0.4250752 -0.0236682 0.1868294 -0.0309901
0.3738770 -0.7368953 -0.1498219 0.2552870 -0.0471391 -0.0218230
-0.1519486 -0.1309302 0.2145775 0.0250026 -0.1281214 0.0193053
-0.3159671 -0.1212518 -0.6556202 -0.3871953 -0.3511945 -0.1180323
-0.2458587 -0.2081178 -0.0024745 -0.1389245 0.1729363 -0.1359751
-0.0256868 0.0432584 -0.0658729 -0.0444859 -0.1276670 0.6168411
-0.0367924 -0.0419739 0.0538921 -0.0010112 -0.2744130 0.5265419
0.2423788 0.5325557 -0.2222849 0.0306888 0.1971996 -0.0485380
-0.0383551 -0.0406179 -0.0238925 -0.0243678 -0.0739668 0.0717726

Now that we have all of the eigen-information, we can decide which components we need to drop. The magnitude of each eigenvalue represents the total variance explained by the corresponding eigenvalue. This means we need to divide each eigenvalue by the sum of all eigenvalues in order to find the percentage of variation explained for each component.

#function to divide by total variance
div <- function(x) {
  x / totV
}

#finding the total variance (sum of the eigenvectors)
totV <- sum(e$values)
percent <- data.frame(sapply(e$values, div))
rownames(percent) <- c("EV1","EV2","EV3","EV4","EV5","EV6","EV7","EV8","EV9","EV10","EV11","EV12","EV13","EV14","EV15","EV16","EV17")

percent %>%
  kable(col.names = "Percentage Explained") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Percentage Explained
EV1 0.6712578
EV2 0.2845367
EV3 0.0442055
EV4 0.0000000
EV5 0.0000000
EV6 0.0000000
EV7 0.0000000
EV8 0.0000000
EV9 0.0000000
EV10 0.0000000
EV11 0.0000000
EV12 0.0000000
EV13 0.0000000
EV14 0.0000000
EV15 0.0000000
EV16 0.0000000
EV17 0.0000000

From this calculation we can see that the first 3 eigenvalues account for essentially 100% of the variation in the data. This means that we can drop the other 14 eigenvalues. The rule of thumb is to drop components whose eigenvalues are less than 1. After dropping those components we are left with the following eigenvalues and eigenvectors.

\[ \Large \text{Usable Principal Components} \]

#the first three usable eigenvectors
eVecs <- data.frame(t(e$vectors[1:3,]))
eVecs %>%
  kable(col.names = c("EV1", "EV2", "EV3")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
EV1 EV2 EV3
-0.4558165 -0.0268097 0.0478535
0.1363311 -0.0290869 0.0114693
-0.4029357 -0.0307685 0.0498089
0.5945762 0.6120113 0.0905221
0.0000000 0.0268932 -0.1652756
0.0000000 0.0503393 -0.0926986
0.0000000 0.0138498 -0.0555289
0.0000000 0.0276476 0.0723731
0.0000000 -0.0166692 0.0039898
0.0000000 -0.0062372 0.0295759
0.0000000 0.0054591 -0.0838458
0.0000000 0.0289935 -0.0811746
0.0000000 -0.1303101 -0.3444002
0.0000000 -0.0119043 -0.5690974
0.0000000 0.1197228 -0.4507702
0.0000000 0.1114126 -0.5308279
-0.5077076 0.7574043 0.0265974

Recall that the first component we found contained 67% of the variation in the data. The second component contains 28.5% of the variation. With these two components we preserve 95.5% of the useful information in the data, without the additional 15 columns. Let’s first plot it with one dimension to see if anything jumps out.

#translating the original coordinates to the new principal component coordinates

#getting the first eigenvector
dimOne <- data.matrix(e$vectors[1,])

#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)

#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% dimOne)

#adding column names
colnames(newCoords) <- c("Dim.1")

datatable(newCoords, style = 'bootstrap', options = list(dom = 't'))

preserve6781f346e8026f57 Now that we have translated our original data points onto the axis of our first principal component let’s plot it

#adding the y-coordinate so we can plot it
newCoords$Y <- rep(0,4)

#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Y")

#plotting with the new coords 
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Y, text = rownames(newCoords), mode = 'markers', height = 600) %>%
  add_markers(marker = list(size = 8)) %>%
  add_text(textposition = "top left") %>%
  layout(
    xaxis = list(title = 'PC1', range = c(-100, 120)), 
    yaxis = list(title = 'Y', range = c(-.1,.1))
  )
p

preserve94750b73450dc941 Right away we can see that Northern Ireland seems to be the furthest away from all the other countries. Let’s see if this trend persists with 2 dimensions.

#translating the original coordinates to the new principal component coordinates

#getting the first eigenvector
dimTwo <- data.matrix(e$vectors[1:2,])

#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)

#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% t(dimTwo))

#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Dim.2")

datatable(newCoords, style = 'bootstrap', options = list(dom = 't'))

preserve30f78c6158a84148 We have now successfully projected our original 17 dimension data onto just 2 dimensions that represent 95.5% of the variation in the data. Plotting below:

#plotting with the new coords 
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Dim.2, text = rownames(newCoords), mode = 'markers', height = 600) %>%
  add_markers(marker = list(size = 8)) %>%
  add_text(textposition = "top") %>%
  layout(
    xaxis = list(title = 'PC1', range = c(-90, 120)), 
    yaxis = list(title = 'PC2')
  )
p

preservefa865bee300eaa36 Looking at the graph we can see that Northern Ireland is still one of the furthest data points with respect to the other countries. Along the first principal component Wales, Scotland, and England are all relatively close to eachother. After introducing the second principal component it becomes evident that Wales is the outlier along the second principal component. One last visualization using three principal components. With only 3 principal components we can explain 99.9999% of the variation in the data.

#translating the original coordinates to the new principal component coordinates

#getting the first eigenvector
dimThree <- data.matrix(e$vectors[1:3,])

#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)

#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% t(dimThree))

#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Dim.2", "Dim.3")

#plotting with the new coords 
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, text = rownames(newCoords), mode = 'markers', height = 600) %>%
  add_markers(marker = list(size = 8)) %>%
  add_text(textposition = "top") %>%
  layout(
    xaxis = list(title = 'PC1'), 
    yaxis = list(title = 'PC2')
  )
p

preserve34907a9fb9b679e8 Now that we know which countries are outliers on each principal component, we can disect the principal components to see the loadings with respect to the original column variables. Instead of doing this by hand we will take advantage of the built in packages in r. Shown below are the loadings of the first 3 principal components.

PCA <- prcomp(rotated, scale = F)
rownames(PCA$rotation) <- ukdf2$Category
datatable(PCA$rotation[,1:3], style = 'bootstrap', options = list(dom = 't', pageLength = 17))

preserve4a858cc0a8b2684f Straight away something about PC1 is standing out. The only positive loadings are from the categories Carcase Meat, Fresh Fruit, Fresh Potatoes, and Soft Drinks. This infers that all the other categories have a negative correlation with Carcase Meat, Fresh Fruit, Fresh Potatoes, and Soft Drinks. Countries who produce more Carcase Meat, Fresh Fruit, Fresh Potatoes, or Soft Drinks will produce less of everything else. Let’s try and visualize this with a biplot, to see if our intuitions were correct.

## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: grid
ggbiplot(PCA, cex = 0.5)

This biplot shows the projections of each categorical variable onto the first two principal components. We can see Fresh Potatoes and Soft Drinks standing out from the rest of the categories. These two categories look like they pushed Northern Ireland and Scotland further away. Let’s plot another grouped barchart for those two categories.

#need to drop all rows that dont have Fresh Potatoes and Soft Drinks
df <- ukdf[(ukdf$Category == "Fresh Potatoes" | ukdf$Category == "Soft Drinks"),]
p <- ggplot(df, aes(factor(Category), Amount, fill = Country)) + 
  geom_bar(stat="identity", position = "dodge") + 
  scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(p)

preservecff81324c8553a33

Bank Data

Semantic Analysis

PCA can be used in many different applications, even text analysis! We are using the dataset ‘allWords.csv’, which contains word data scraped from CNNHealth and FoxHealth twitter. Take a look at this dataset below.

# Read in the data
dfwords <- read.csv("allWords.csv")

# Display the data
datatable(head(dfwords), style = 'bootstrap', options = list(dom = 't'))

preserveab0866f4d5030ff8 The words are split up and have a total count as well as a count for each source. Recall that from the PCA overview section, we need this data in the form of a covariance matrix. We need a technique to convert these words into numbers, by giving each one a “score”. In this case, we will use Term Frequency-Inverse Document Frequency, or TFIDF. \[ \text{Define:} \] \[ \text{Term Frequency = TF = (Number of times term t appears in a document) / (Total number of terms in the document)} \] \[ \text{Inverse Document Frequency = TF = log_e(Total number of documents / Number of documents with term t in it).} \] \[ \text{Term Frequency-Inverse Document Frequency = TFIDF = TF * IDF} \] Now that we have the basics of TFIDF down, we can use some libraries in R for easy data transformation and TFIDF.

#we are using the libraries "tidytext" and "tidyverse"
# Converting the dataframe into a tibble for easy manipulation
allWords <- as_tibble(dfwords)

#Starting with a quick visualization to see total word frequency
ggplot(allWords, aes(n/total, fill = Source)) +
  geom_histogram(show.legend = FALSE, binwidth = 0.1) +
  facet_wrap(~Source, ncol = 2, scales = "free_y") +
  xlab("Term Frequency")

The above plot is showcasing what term frequency is. TF is the number of times a word appears in a novel divided by the total number of terms (words) in that novel. From here we can see that fox tends to have relatively more words with medium term frequency than cnn. The far right bar most likely corresponds to words like “as”, “a”, “the”, or “and”. When we run TFIDF, these terms will go to 0 and be ignored during the final step. Now we can run TFIDF and get some results.

# Running TDIDF from the tidytext library
wordScores <- allWords %>%
  bind_tf_idf(Word, Source, n)

remove <- c("aug", "nov", "jul", "sep", "sug", "oct",
          "jun", "there", "any", "because", "let",
          "where", "est", "thats", "youve", "tells", "theyre",
          "lot", "sure")
idx <- which(!wordScores$Word %in% remove)
wordScores <- wordScores[c(idx),]
datatable(head(wordScores), style = 'bootstrap', options = list(dom = 't'))

preserve67c55e2d91f1b60e Now that we have some TFIDF data, let’s sort this and visualize it from the top 20 words for each news source.

#Sorting and grouping to get the top 20 words
wordScores %>% 
  arrange(desc(tf_idf)) %>%
  mutate(Word = factor(Word, levels = rev(unique(Word)))) %>% 
  group_by(Source) %>% 
  top_n(20) %>% 
  ungroup %>%
  ggplot(aes(Word, tf_idf, fill = Source)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~Source, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf

From this comparison we can see a drastic difference in words from both news sources. Let’s set up our data for PCA. Recall that PCA involves the eigen-decomposition of a covariance matrix. So firstly we need to create a covariance matrix out of our existing data.

#For now we will only worry about the tf_idf score
# We should drop all scores of 

# Apply returns the set of indices not equal to 0
# c("Word", "tf_idf") selects two columns
pcaText <- wordScores[
  apply(wordScores["tf_idf"], 1, function(row) all(row != 0)),c("Word", "tf_idf", "Source")
  ]
grouped <- pcaText %>% 
  group_by(Source)
grouped
## # A tibble: 4,471 x 3
## # Groups:   Source [2]
##    Word          tf_idf Source
##    <fct>          <dbl> <fct> 
##  1 aaron      0.0000231 cnn   
##  2 abandoning 0.0000571 fox   
##  3 abbvie     0.0000571 fox   
##  4 abdomen    0.000114  fox   
##  5 abduction  0.0000463 cnn   
##  6 abiltity   0.0000231 cnn   
##  7 able       0.000255  cnn   
##  8 abort      0.0000231 cnn   
##  9 abortion   0.000116  cnn   
## 10 abroad     0.0000231 cnn   
## # … with 4,461 more rows
cor(grouped[,c("tf_idf")])
##        tf_idf
## tf_idf      1
datatable(head(pcaText), style = 'bootstrap', options = list(dom = 't'))

preservee0d70c8875cd2855 This data currently has 4,471 rows, which means it will be a 4,471 \(\times\) 4,471 covariance matrix. Let’s see if we can gain any knowledge ##Upload Data TBD: Make a generic PCA tutorial for any data set(n > 2). Allow users to specify dimension after reduction Discuss with lucero about the format of this.